Homework 09: Networks, Variable Visualization, and New 1-D Graph Critiques

General instructions for all assignments:

  • Use this file as the template for your submission. Delete the unnecessary text (e.g. this text, the problem statements, etc). That said, keep the nicely formatted “Problem 1”, “Problem 2”, “a.”, “b.”, etc
  • Upload a single R Markdown file (named as: [AndrewID]-315-HW09.Rmd – e.g. “mneykov-315-HW09.Rmd”) to the Homework 09 submission section on Canvas. You do not need to upload the .html file.
  • The instructor and TAs will run your .Rmd file on their computers. If your .Rmd file does not knit on our computers, you will be automatically be deducted 10 points.
  • Your file should contain the code to answer each question in its own code block. Your code should produce plots/output that will be automatically embedded in the output (.html) file
  • Each answer must be supported by written statements (unless otherwise specified)
  • Include the name of anyone you collaborated with at the top of the assignment
  • Include the style guide you used below under Problem 0


Important Note – This Is A Group Assignment!

You should complete HW09 as a group. Only one submission is needed per group. Groups that submit multiple assignments may lose points at the instructors’ discretion.

Problem 0

  1. THEME:
library(tidyverse)
achidamb_315_theme <-  theme_bw() + # White background, black and white theme
  theme(axis.text = element_text(size = 10, color = "navy",family = "serif"),
        text = element_text(size = 14, face = "bold", color = "navy"))

COLOR:

achidamb_color_palette <- c("#2D3184","#0082A6", "#4EBBB9", "#9CDFC2", "#D8F0CD","#F3F1E4")
  1. I am using the Hadley Wickham’s Advanced R Style Guide for this assignment.


Problem 1

(2 points each)

Parallel Coordinates and Radar Charts

There are no standard ggplot() geometries for creating parallel coordinates plots or radar charts, but there is an implementation in the GGally package.

  1. Create a parallel coordinates chart displaying the continuous variables in the Cars93 dataset. Color the lines by the Type of car. Code is partially completed for you below. Be sure to rotate the x-axis labels, update the legend, and add titles/axis labels:
library(MASS)
library(tidyverse)
library(GGally)
data(Cars93)
cont_cols <- which(names(Cars93) %in% 
                     c("Cars93", "Price", "MPG.city", "MPG.highway", "EngineSize",
                       "Horsepower", "RPM", "Fuel.tank.capacity", "Passengers",
                       "Length", "Wheelbase", "Width", "Turn.circle", "Weight"))

ggparcoord(Cars93, columns = cont_cols) + aes(color = factor(Type)) + coord_flip() + labs(title = "Value vs Variables by Type of Car", ylab = "Variables", xlab = "Value" ) +
  achidamb_315_theme

  1. Car type 4 gets better milage in MPG.highway and MPG.city than the rest of the car types. The values for car type 4 in these categories are around 4. Car Type 6 fits the most passengers and has a value close to 3 in the graph.

  2. Repeat part (a), but create a radar chart instead. To do this, simply add + coord_polar() to your parallel coordinates code. Which plot is easier to read?

ggparcoord(Cars93, columns = cont_cols) + aes(color = factor(Type)) + coord_flip() + 
  labs(title = "Value vs Variables by Type of Car", ylab = "Variables", xlab = "Value") + coord_polar() +
  achidamb_315_theme

  1. What is the default y-axis in this implementation of parallel coordinates charts? (Hint: Look at the scale parameter.) What could you change the scale parameter to in order to mimic the way parallel coordinates charts were introduced in class? Do this, and create a new graph showing the result.

The default scale for y-axis is standard deviation, which allows us to compare how much different car types vary at each variable. The parameter that matches our introduction in class is uniminmax and we didn’t flip the coordinates.

ggparcoord(Cars93, columns = cont_cols, scale = "uniminmax") + aes(color = factor(Type)) +  
  labs(title = "Value vs Variables by Type of Car", ylab = "Variables", xlab = "Value") +
  achidamb_315_theme

  1. Are any adjacent pairs of variables in your graph from part (d) positively correlated? Are any adjacent pairs of variables negatively correlated? Answer this using your parallel coordinates plot, and explain how you obtained this answer. (It may help to wait until Monday’s lecture to answer this.)

There aren’t any two types that are really positively correlated. The two most positively correlated car types are probably car types (1,3) and (2,3). The type 3 color lines are usually between type 2 and type 1, and they tend to follow a similar pattern with slightly different degree. The two pairs that are most negatively correlated are type (4,6) and (2,4). Type 4 is mostly at the opposite value as 2 and 6, for example, for the last five variables, type 4 cars are mostly at the bottom with values around 0, but type 2 adn 6 are all the way on the top with values close to 1.



Problem 2

cars_cont <- dplyr::select(Cars93, Price, MPG.city, MPG.highway, EngineSize, 
                           Horsepower, RPM, Fuel.tank.capacity, Passengers,
                           Length, Wheelbase, Width, Turn.circle, Weight)
library(reshape2)
correlation_matrix <- cor(cars_cont)
melted_cormat <- melt(correlation_matrix)
ggplot(data = melted_cormat, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile()+labs(title = 'Cars Correlation Heat Map', x = '', y = '') + 
  scale_fill_gradient2(low = "dark red", high = "dark blue",   
                       mid =  "light grey", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Correlation") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

  • We notice that Weight and Turn.circle, Weight and Width, Weight and Wheelbase, Weight and Fuel tank capacity, Engine Size and Width, and MPG city and MPG highway are highly positively correlated pairs of variables.
  • We notice that Mpg City and Weight, MPG highway and Weight, and MPG city and fuel tank capacity and MPG highway and fuel tank capacity are highly negatively correlated pairs of variables.
  • We notice that Passengers and Price, RPM and Price, Horsepower and Passengers, and Horsepower and RPM are variable pairs with little to no correlation

The plot above is a heat map. It is a graphical representation of the correlations between all possible variables and uses a color gradient to show the strength/value of the correlation. It does so for a matrix of the variables giving us every possible pair of selected variables from the dataset.

  1. This reminds me a lot of mosaic plots.

# Taken from guide
reorder_cormat <- function(cormat){
# Use correlation between variables as distance
dd <- as.dist((1-cormat)/2)
hc <- hclust(dd)
cormat <-cormat[hc$order, hc$order]
}


get_upper_tri<-function(cormat){
    cormat[lower.tri(cormat)] <- NA
    return(cormat)
  }

correlation_matrix <- cor(cars_cont)
correlation_matrix <- reorder_cormat(correlation_matrix)
correlation_matrix <- get_upper_tri(correlation_matrix)
melted_cormat <- melt(correlation_matrix, na.rm=TRUE)
ggplot(data = melted_cormat, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + labs(title = 'Cars Correlation Heat Map', x = '', y = '') + 
  scale_fill_gradient2(low = "dark red", high = "dark blue",   
                       mid =  "light grey", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Correlation") + 
  geom_text(aes(Var1, Var2, label=sprintf("%0.2f", round(value, digits = 2))), 
            color = "green", size = 6) +
  theme(text = element_text(size=15), 
        axis.text.x = element_text(angle = 90, hjust = 1))



Problem 3

(20 points)

Variable Dendrograms

Another way to visually explore potential associations between continuous variables in our dataset is with dendrograms.

  1. (15 points) Create a “variable dendrogram” of the continuous variables in the Cars93 dataset. To do this:
  • Select the continuous variables from the dataset
  • Compute the correlation matrix for these variables
  • Correlations measure similarity and can be negative, while distances measure dissimilarity and cannot be negative. As such, convert your correlations to instead be one minus the absolute value of the correlations, so that correlations near 1 or -1 will have distances of 0, and correlations near 0 will have distances of 1, e.g.: cormat <- 1 - abs(cormat)
  • Convert your transformed correlation matrix to a distance matrix with the as.dist() function.
  • Submit this distance matrix to hierarchical clustering (hclust()), convert the result to a dendrogram (as.dendrogram()), then plot with ggplot().
  • Color the branches by the four-cluster solution. See the link in HW08 for how to do this.
  • Be sure to adjust your axis labels, add a title, etc.
  • The resulting dendrogram should plot highly correlated variables (positively or negatively correlated) in the same branches / clusters in the dendrogram, while uncorrelated variables will be linked at higher “distances” on the dendrogram.
  1. (5 points) Examine the four-cluster solution. Which variables are in the same cluster? Does it make sense that these are in the same cluster, given both your common-sense understanding of these variables and given the correlation plot you created in Problem 2?

  2. (1 point) What other measures (other than correlation) could you use to measure similarity / dissimilarity between continuous variables for the purposes of a variable dendrogram? (There is not necessarily a right or wrong answer here – just brainstorm ideas.)

If you’re finding your graphic runs over the boundaries, try standard approaches of ylim and xlim changes. Additionally, sometimes the knitted file looks different - knit once before you run.



Problem 4

(2 points each)

Love Actually Character Network

  1. Read this article from FiveThirtyEight. Write 2-3 sentences summarizing any methods of analysis that they used.

  2. Load the Love Actually adjacency matrix from FiveThirtyEight’s GitHub Page. Store this in an object called love_adjacency. Convert this into a distance matrix, using \(1/(1+x)\) as a conversion function between the adjacencies and the distances. Use hierarchical clustering with average linkage (method = "average" in hclust()) and convert the result to a dendrogram. Visualize this with ggplot(), add appropriate titles/labels/themes/etc. (Code is partially provided to do this.)

library(dendextend)
love_adjacency <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/love-actually/love_actually_adjacencies.csv")
love_dist <- 1 / (1 + as.dist(love_adjacency[,-1]))
  1. Interpret the resulting dendrogram. Which chacacters are connected in the movie?

  2. Read about the ggraph package here. What does it do? When was it released? What ggplot() like function can you do with ggraph?

  3. Read this post on adjusting the edges in ggraph. How would you create a dendrogram with the ggraph package? Use the example code at this link to create a dendrogram with this dataset using ggraph (NOT the same way that you created it in part (b)).

  4. Create a basic network diagram of the Love Actually data using the ggraph package. Code is partially started for you below.

library(igraph)
library(ggraph)

names <- love_adjacency[,1]
graph <- graph_from_adjacency_matrix(as.dist(love_adjacency[,-1]))
  1. (2 points each) Using the documentation at the link in parts (d), (e), and the ggraph GitHub page, make at least three adaptations to your graph from (f). For example, you might size the points, size the edges, use arcs (curved edges), use geom_edge_density, etc.

  2. (BONUS: 3 points) Color the nodes of the graph by the gender of the actor/actress. Facet on the gender of the actor/actress.



Problem 5

(4 points each)

Waffle Charts

  1. Use the code below creates a “Waffle Chart” with ggplot(). What is the purpose of a waffle chart? What would you use a waffle chart to visualize? (I.e. what type of data? How many dimensions/variables?)
#  Set up data to create the waffle chart
library(MASS)
data(Cars93)
var <- Cars93$Type  # the categorical variable you want to plot 
nrows <- 9  #  the number of rows in the resulting waffle chart
categ_table <- floor(table(var) / length(var) * (nrows*nrows))
temp <- rep(names(categ_table), categ_table)
df <- expand.grid(y = 1:nrows, x = 1:nrows) %>%
  mutate(category = sort(c(temp, sample(names(categ_table), 
                                        nrows^2 - length(temp), 
                                        prob = categ_table, 
                                        replace = T))))

#  Make the Waffle Chart
ggplot(df, aes(x = x, y = y, fill = category)) + 
  geom_tile(color = "black", size = 0.5) +
  scale_x_continuous(breaks = NULL) +
  scale_y_continuous(breaks = NULL) +
  scale_fill_brewer(palette = "Set3") +
  labs(title = "Waffle Chart of Car Type",
       caption = "Source:  Cars93 Dataset", 
       fill = "Car Type",
       x = NULL, y = NULL) + 
  theme_bw()  #  Use your theme

  1. Create a waffle chart for the content_rating variable in the imdb data from the lab exam. Use 25 rows. Then recreate the same graph, but use 50 rows. Which version of the chart do you prefer?

  2. Critique these graphs. What are the issues with waffle charts?



Problem 6

(1 point each)

Arc Pie Charts

Install and load the ggforce package. This package implements several updates and improvements to ggplot2.

  1. Create an “arc pie chart” of the Type variable in the Cars93 dataset. (Code provided.)
library(ggforce)
Cars93 %>% group_by(Type) %>% 
  summarize(count = n()) %>% 
  mutate(max = max(count),
         focus_var = 0.2 * (count == max(count))) %>%
  ggplot() + geom_arc_bar(aes(x0 = 0, y0 = 0, r0 = 0.8, r = 1, 
                              fill = Type, amount = count), 
                          stat = 'pie')

  1. Adjust the r0 parameter to lower and higher values. What does this control? What is the minimum and maximum value?

  2. Recreate the graph from (a), but this time, add explode = focus_var into your call to aes(). What does this do?

  3. Recreate the graph from (c), but this time, add focus to the category with the minimum number of observations.

  4. (4 points) Critique these graphs.

  • What would you use an arc pie chart to visualize? (I.e. what type of data? How many dimensions/variables?)
  • What are the issues with arc pie charts?
  • What are the issues with using explode to focus on a particular variable?


Problem 7

(5 points each)

Zoom Zoom

See the following code working with the IMDb movies dataset from Homework 7 for how to use facet_zoom().

library(tidyverse)
library(forcats)
library(devtools)
library(ggforce)

#  Colorblind-friendly color pallette
my_colors <- c("#000000", "#56B4E9", "#E69F00", "#F0E442", "#009E73", "#0072B2", 
               "#D55E00", "#CC7947")

#  Read in the data
imdb <- read_csv("https://raw.githubusercontent.com/mateyneykov/315_code_data/master/data/imdb_test.csv")

# get some more variables
imdb <- mutate(imdb, profit = (gross - budget) / 1000000,
               is_french = ifelse(country == "France", "Yes", "No")) %>%
  filter(movie_title != "The Messenger: The Story of Joan of Arc")
france_1990 <- filter(imdb, country == "France", title_year >= 1990)

# this code plots a scatterplot + a zoomed facet
ggplot(data = imdb, aes(x = title_year, y = profit)) + 
  geom_point(color = my_colors[1], alpha = 0.25) + 
  geom_smooth(color = my_colors[2]) + 
  geom_point(data = france_1990, color = my_colors[3]) + 
  geom_smooth(data = france_1990, aes(x = title_year, y = profit), 
              color = my_colors[4], method = lm) + 
  facet_zoom(x = title_year >= 1990) + 
  labs(title = "Movie Profits over Time",
       subtitle = "Zoom:  French Movies from 1990 -- 2017 (orange/yellow)",
       caption = "Data from IMDB and Kaggle",
       x = "Year of Release",
       y = "Profit (millions of USD)")

Also read the articles here, or here.

  1. Recreate any scatterplot that we created throughout the year, and zoom in on a section of the graph via the facet_zoom() feature in the newest version of the ggforce package. Include a title, subtitle, and caption in the resulting graph. The caption should just state the data source, and the subtitle should explain what area of the plot is being enhanced via zooming.

  2. Interpret the resulting graph: Describe some feature of the new version of the graph that you may not have been able to see very well in the previous version of the same graph (without zooming).